packages <- c('tidyverse', "vembedr")
#invisible(lapply(packages, install.packages, character.only = TRUE)) #Install packages if any are missing
invisible(lapply(packages, require, character.only = TRUE)) #Load all packages
set.seed(1234) #set seed for reproducibility
Welcome to tutorial 1 of COSMOS Konstanz. These code notebooks are designed to be self-contained learning tools for building on the concepts taught during the summer school, but should also be accessible for anyone from around the world interested in these topics.
The code notebooks will introduce theoretical concepts, provide basic mathematical formulas, and helpful code snippets that you can use in your own projects. There are also links to demonstrations of simple online experiments and interactive Shiny apps that will allow you to play around with different experimental paradigms, computational models, and simulation frameworks. These notebooks will also contain exercises, which offer suggestions for how to go beyond the materials.
Multi-armed bandit problem
Our first goal is to design a simple learning environment for studying social learning and collective behavior. We start with a multi-armed bandit problem (Robbins, 1952; Steyvers et al., 2009; Wald, 1947) as one of the simplest tasks commonly used to study learning in humans, animals, and artificial agents.
The term “multi-armed bandit” is simply a colorful metaphor for a row of
slot machines in a casino. Since slot machines are notorious for
stealing your money, they’re sometimes known as a “one-armed bandit”, in
reference to the lever used to play the machine.
Each of the \(k \in [1,\ldots, K]\) slot machines refer to one of \(K\) different available decisions or options, each yielding a reward from an initially unknown payoff function. We can easily map this metaphor onto a wide range of real-world learning problems, such as choosing what to order at a restaurant, choosing where to forage for food, or choosing which crops to plant. The goal is thus to maximize your payoffs, by sequentially playing one of the options at each time \(t\).
In order to do so, one must balance between exploring new options to acquire information about payoffs (to inform future decisions) and exploiting options known to have high-payoffs (in order to acquire immediate rewards). This is known as the exploration-exploitation dilemma (Mehlhorn et al., 2015) and every living organism capable of learning must achieve an efficient balance to these dual goals. Yet, given a finite time horizon for learning, optimal solutions are generally unavailable (Aplin et al., 2015; Deffner et al., 2020; Sasaki et al., 2013).
2-armed Gaussian bandits
Let’s start from the simplest case and program a 2-armed bandit with Gaussian rewards. Imagine you are a farmer, and can choose to either plant one of two crops: wheat vs. potato. At the end of the growing season, you then earn a reward proportional to how well your crop performed. However, one of the crops may produce reliably better yields, and it is up to the farmer to figure out which crop is better.
The farmer may experience better yields or poorer yields depending on the choice of crops as well as some raddom noise from the environment.
Year after year, there may be variations in yield for a given crop, which we can model using Gaussian noise. Thus, we can describe the reward for option \(k\) as a sample from a Gaussian distribution (also known as a Normal distribution, hence the fancy \(\mathcal{N}\)):
\[ r_k \sim \mathcal{N}(\mu_k, \sigma_k^2) \]
We can now describe a payoff function for each option in terms of a mean or expected reward \(\mu_k\) and the variability \(\sigma_k^2\) (expressed as a variance). Note that the variability can also be expressed in terms of a standard deviation as \(\sigma_k = \sqrt{\sigma_k^2}\).
Generative model of the environment
A generative model of the environment takes a choice as an input and returns a payoff as an output.
Let’s now program a generative model of the bandit task. By defining the number of options, and providing the expected reward and variability for each option, we can define a function that takes an action and returns a reward. We provide the function in the code block below and generate payoffs for 25 random choices.
K = 2 #number of options
meanVec <- runif(n=K, min = -10, max = 10) #Payoff means, which we sample from uniform distribution between -10 to 10
sigmaVec <- rep(1, K) #Variability of payoffs (as a standard deviation), which we assume is the same for all options.
banditGenerator <- function(k) {#k is an integer or k vector of integers, selecting one of the 1:K arms of the bandit
payoff <- rnorm(n = length(k), mean = meanVec[k], sd = sigmaVec[k])
return (payoff)
}
actionSequence <- sample(1:K, size = 25, replace = TRUE) # select 25 random actions
payoffs <- banditGenerator(actionSequence)
df <- data.frame(action = actionSequence, payoff = payoffs)
df
## action payoff
## 1 2 3.8159985
## 2 2 0.7586613
## 3 1 -8.3533680
## 4 2 2.4643047
## 5 1 -7.0206883
## 6 1 -8.3729508
## 7 1 -6.8577509
## 8 2 2.8216237
## 9 2 2.7562503
## 10 2 2.4509950
## 11 2 2.4083578
## 12 1 -7.0019557
## 13 2 1.9492492
## 14 2 2.4573833
## 15 2 2.4558480
## 16 1 -7.0476604
## 17 2 3.4755511
## 18 2 0.7164596
## 19 2 0.2416400
## 20 2 2.9891610
## 21 2 2.4674153
## 22 2 2.6085539
## 23 2 3.6877424
## 24 2 3.2287656
## 25 1 -7.6778111
Let’s now visualize the environment, where we use a large number of samples (n=10000) to paint an accurate picture of the true generative reward structure (vertical density plot), and plot that against a small number (n=25) of sampled payoffs (dots):
source('codeSnippets/geom_flat_violin.R')
# Use samples to approximate the reward distribution
sampleDF <- data.frame()
nsamples = 10000 #Samples used to approximate the normal distribution
plotsamples = 25 #samples to plot as dots
rewardSamples <- c(sapply(1:K, FUN=function(k) banditGenerator(rep(k, nsamples))))#use an apply function to simulate multiple samples of each arm
sampleDF <- data.frame(option = rep(1:K, each = nsamples), payoff=rewardSamples, nsample = rep(1:nsamples, K)) #create a dataframe for plotting
ggplot(sampleDF, aes(x = factor(option), y = payoff, fill = factor(option)))+
geom_flat_violin(position = position_nudge(x = 0.3, y = 0), alpha = .5) +
geom_jitter(data = subset(sampleDF, nsample<=plotsamples), aes(y = payoff, color = factor(option)),width = 0.2 )+ #Plot only a subset of points show how well a limited number of samples approximates the true distribution
ylab('Payoff')+
xlab('Option')+
theme_classic() +
scale_fill_viridis_d()+
scale_color_viridis_d()+
ggtitle('Payoff conditions')+
theme(legend.position ='none')
Scaling up
Of course, a 2-armed bandit is quite a simple learning environment. But we can very easily scale up the number of options. How many trials do you think it would take you to identify the best option below?
#define larger set of option
K <- 10
meanVec <- runif(n=K, min = -10, max = 10) #Payoff means, which we sample from uniform distribution between -10 to 10
sigmaVec <- rep(1, K)
# Use samples to approximate the reward distribution
sampleDF <- data.frame()
nsamples = 10000 #Samples used to approximate the normal distribution
plotsamples = 25 #samples to plot as dots
rewardSamples <- c(sapply(1:K, FUN=function(k) banditGenerator(rep(k, nsamples))))#use an apply function to simulate multiple samples of each arm
sampleDF <- data.frame(option = rep(1:K, each = nsamples), payoff=rewardSamples, nsample = rep(1:nsamples, K)) #create a dataframe for plotting
ggplot(sampleDF, aes(x = factor(option), y = payoff, fill = factor(option)))+
geom_flat_violin(position = position_nudge(x = 0.3, y = 0), alpha = .5) +
geom_jitter(data = subset(sampleDF, nsample<=plotsamples), aes(y = payoff, color = factor(option)),width = 0.2 )+ #Plot only a subset of points show how well a limited number of samples approximates the true distribution
ylab('Payoff')+
xlab('Option')+
theme_classic() +
scale_fill_viridis_d()+
scale_color_viridis_d()+
ggtitle('Payoff conditions')+
theme(legend.position ='none')
Exercise 1.1. Can you modify the payoff generating function to describe other tasks? Sometimes rewards are binary, such as with success or failure outcomes when publishing a paper or hunting for food. With binary reward structures, you can use Bernoulli distribution in place of a Gaussian distribution. The code snippet below contains a brief example.
K <- 10
successVec = runif(K) #different success probabilities for each option sampled from a uniform distribution between 0 and 1
binaryGenerator <- function(k) {#k is an integer or k vector of integers, selecting one of the 1:K arms of the bandit
payoff <- rbernoulli(n = length(k), p=successVec[k]) #a Bernoulli distribution can be used to describe a binary reward structure
return (payoff)
}
binaryGenerator(sample(1:10, 25, replace = TRUE))
Live demonstrations
Let’s now dive in and experience learning and exploration for yourself!
Individual learning
We will start with a simple individual learning version of the task, where you’ll each be on your own. Think about how you learn which options provide the best rewards and how you navigate the explore-exploit dilemma.
A 16-armed bandit task.
The demo game is an 16-armed bandit task. Suppose you are in the fishing boat and trying to catch as many fish as possible. The ocean is divided into the 16 grid cells. Some grid areas will provide more fish than other grid areas. To make a decision, just click/tap one of the cells.
Social learning
Now, let’s switch to a social version of the task. Beyond simply learning from your own experiences, you can observe the actions of others and use that to inform your decisions. Here, we start with the simplest case where all players are interacting with the exact same reward function. Of course, this isn’t always the case in the real world, but it is the simplest starting point.
You will be able to observe the actions of other players, but not see the rewards they earn. Do you think that collectively, you will be able to identify the best option? When do you think social learning will outperform individual learning, and when would it fail?
Social learning experiment